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Abstract 


A majoi contribution of this thesis has been in the design and imiilementa- 
(luu of a Vhtcibi docodci. A Vitoibi docodoi is a very impoitant blo< k m any CDMA 
modem The aim was to design a 19 2 kbps ,256 state Viterbi decoder with added 
t apabilitj of catering to higher input data rates To the best of our knowledge none 
of the existing liteiature discusses Viterbi decoding implementation based on high 
level synthesis targetted for Field programmable Gate Arrays (FPGAs) This has 
been the focus in the piesent thesis 

Besides the above, some of the issues such as organization of a path iiiem- 
oiv, decision niemoiy, the decision memory reading techniques, and the clocking 
mechanism have been discussed We have also retained the bit-synchronization m- 
foimation, even though it made the normalization of the path metrics essential We 
have used very fast subtracters to implement normalization of the path metrics 

In this thesis we explore an implementation methodolgy for rapid proto- 
typing using FPGAs In designing the Viterbi decoder we have used some of the 
fc'atuies discussed m the literature, however, we have attempted to compare then 
11101 its with othei existing techniques and also the advantages that have accrued be- 
cause of their use It has not been our attempt to give just another implementation 
of the Viteibi algorithm, but to create design guidelines which may be used m any 
futuie implementation of the decoder 
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Chapter 1 


Design of a CDMA Modem - 
Some Implementation Issues 

1.1 Introduction 

In recent years there has been tremendous growth in a new technolog} for multiple 
access communication namely, code division multiple accessing (CDMA) This is in 
addition to the already existing technologies, such as frequency division multiple ac- 
cess (FDMA) and time division multiple accessing (TDMA) Though no technology 
IS well suited for all situations, CDMA has some features which makes it attractive 
for multiuser communications In this chapter, we compare above three technolo- 
gies with respect to their ability to address various issues such as capacity We also 
discuss some features which are special to the CDMA technology 

1.2 Basics of CDMA 

While TDMA or FDMA achieve user discrimination by time or frequency separabil- 
ity of the channels, CDMA assigns each user a signature sequence for separability 
In CDMA each signal consists of a different pseudorandom binary sequence that 
modulates the carrier, thereby spreading the spectrum of the waveform A large 
number of CDMA signals share the same frequency spectrum and also, there is no 
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orthogonality as created bv time division When viewed in time or in frequency 
domain m CDM A. we have signals on top of one another The signals are separated 
bv a correlating receuer which collapses only the energy of the selected binai\ se 
quence The other user signals just represent interference to the pseudo random 
tuned receiver [1] 

1 3 Some Different CDMA Systems 

13 1 Direct Sequence Cellular CDMA (DS/CDMA) 

In Direct Sequence Cellular CDMA (DS/CDMA) usuallv one frequencv band is 
used for the base to mobile link (forward link) and a separate frequenci band is 
used for mobile to base link (reierse link) 

The two links forward and reverse differ with each other m some respects 
On the forward link a common pilot is used for channel estimation and time syn 
chronization The users can be orthogonalized The re\ erse link on the other hand 
does not enjoy these features It cannot be oithogonalized in vieu of the different 
locations as well as independent movements of the mobiles (Appendix A.) 

Characteristics of DS/CDMA 

• Universal Frequency Reuse DS/CDMA cellular system can apply a uni 
versal one cell frequency reuse pattern If the traffic requirement at a certain 
location increases introduction of a new cell will be less restricted than m the 
case of either FDkIA or TDMA 

• Power Control 

1 Reverse Link The reverse link is designed to be asynchronous and 
hence is beseiged with “near far” ^ problem A solution to the near far 
problem is the use of power control, which attempts to ensure that all 
^near far problem exists when a transmitter of higher strength drowns another transmitter of 
lesser strength 
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signals from the mobiles within a given cell arrne at the cell site with 
equal power The power control required must be accuiate (typically 
within Idb) fast enough to compensate for Ravleigh fading and have a 
large dynamic range (SOdb) [ 2 ] 

2 Forward Link The forward link does not suffer from the near far ’ 
problem Power control is applied to increase power to mobiles suffering 
from excessive inter cell interference The requirements of dvnamic range 
and speed are not as btnngent as for the reverse link 

• Soft Handoff and Space Diversity The users information is sent via 
two or more base stations which is diversity combined by the user s receiver 
Power control of the mobile is coordinated by the base station that receives 
the strongest signal The soft handoff [3] provides a ‘ mal e before break’ 
handoff transition 

• Coding Coding redundancv can be regarded as part of the spreading Code 
rates of 1/2 and 1/3 are npically used 

• Voice Activity To reduce multiple access interference (M A.I) transmission 
IS stopped when voice or data ativity is absent 

• Antenna Gain Fixed sectored antennas and phased arrav s accomplish re 
duction of MAI, and hence increase the user capacity Typically a three sec 
tored antenna is used 

Current DS designs 

We will consider the IS 95 standard It employs BPSK data and QPSK spreading 
on the forward link , in conjunction with synchronous mtracell transmission The 
chip rate ^ is 1 228 MHz and forward error correction is employed with a K=9, 1/2 
convolutional code The use of spread pilot allows coherent detection to take place 
Rhe chip rate is equivalent to the code generator clock rate or m frequency hopping systems 
the hop rate 
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On the reverse link 64 ary binar\ orthogonal signalling is used in conjunc 
tion with a k=9 1/3 convolution code 

13 2 Frequency Hopping Cellular CDMA 

A-lmost all Frequency Hopping (FH) systems are slow frequenc\ hopping (SFH) 
s}- stems (le multiple bits are transmitted on each hop) (Appendix A) 

SFH was proposed for cellular systems m the literature in [4 5] and the use 
of such systems has been discussed in [6] The main difference in the performance 
between DS and FH is that a FH system can be approximately orthogonahzed thus 
limiting the user interference 

The potential advantages of this tvpe of FH are 

• Less total interference 

• No near far problem 

• Better performance during jamming 

• No need for guard bands at the fringes of the spectrum 

FH In commercial applications 

FH [7] has been widely used in most military communications systems for the twin 
reasons of fast synchronization and the absence of near far problem 

In commercial applications for a DS system by making the codes shorter, 
synchronization time is not constrained as m the military communications Fur 
ther svnchromzations techniques not practical in military systems are possible in 
commercial applications 

In a consumer application, with full duplex signals the probability of colh 
Sion with FH system is twice as high as in a military system with greater channels 
and lesser number of half duplex users The quality of service provided b-v FH sys 
tern is not as good as that provided by the DS system, and hence, DS systems have 
outperformed FH systems m commercial/consumer applications 
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1 4 Capacity of CDMA 


A.b recently as 1985 [8] a straightforward comparison of the capacity of CDM A. 
with that of TDMA or FDMA was in fa\oi of TDMA and FDM A However unlike 
TDMA or FDMA capacity in CDMA is interference limited and any reduction 
in interference directly leads to a linear increase in capacity Some of the factors 
like \oice activity and spatial isolation have rendered the capaciU of CDMA at 
least double than that of FDMA or TDkl A under similar assumptions for a mobile 
satellite application 

Due to nonorthogonahU of the CDMA waveforms though the capacity with 
respect to an isolated cell m CDMA is less than that with respect to either FDM A 
or TDMA it is in the multicellular emironment that the comparisons are reversed 

Unlike TDMA or FDMA DS spread waveforms can be used to reject raul 
tipatli returns, or enhance overall performance by diversity combining multipath 
returns in a RAKE receiver^ 

Another consideration in favor of CDMA is the frequencv reuse factor For 
FDkl A or TDMA frequencies used are not used in adjacent cells For example for 
analog AMPS svstem [2], a frequency reuse of one m seven is employed However 
with CDMA a possibility of a frequency reuse of one in one exists (frequency reuse 
of two m three has been used in [1] 

CDMA also provides a natural way to exploit the bursty nature of a source 
for added capacity Using a voice activity of 1/2, in principle the capacity of CDMA 
can be doubled 


a RAKE receiver signals arriving via different path delays are recombined for correlation 
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1 5 CDMA Versus FDMA 

The four major factors [9] that have re\ersed the comparison of capacity in fa\oi 
of CDMA are - 

• Voice activity 

• Spatial discrimination provided bv satellite multibeam antenna 

• Cross polarization frequencj reuse 

• Discrimination between multiple satellites providing co fo\erage 

The voice actuitv greatly reduces the self noise of the spread spectrum 
system and utilizes the satellite downlink power more efficiently 

In FDMA, the voice activity factor does not increase the capacity when the 
system is bandwidth limited but only reduces the satellite downlink power when 
the satellite is in a poiver limited mode FDMA &\ stems aie unable to exploit \oice 
actnit-v factor to improve the capacity of bandwidth limited mobile to hub links 
because of the delays inherent with synchronous orbit satellite 

The capacity of CDMA system is further improved by multibeam antennas 
which allow a degree of frequency reuse Though a FDMA sytem can increase 
capacity by employing such antennas the increase in capacity bv frequency reuse is 
substantial m the case of CDMA Frequency reuse is greater for CDMA, because the 
antennas can be spaced by B(say) degrees and the entire frequency band reused every 
B degrees With a large number of beams, CDM A. can reuse the entire frequency 
band in each antenna beam This is a factor of three superior to that obtainable 
with FDMA 

CDMA can use the entire frequency band by utilizing the opposite senses 
of cross polarization Polarization isolation cannot usually be exploited by a FDMA 
system employing small mobile antennas because such antennas provide only limited 
cross polarization, usually less than the necessary co channel channel interference 
ratio ( C/I ) required by FDMA 
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A possibility for increasing the channel capacity in CDMA is bv using multi 
pie satellites and by the coherent combining of signals transmitted bedi een a satellite 
and all satellites in view 

This technique is not available for FDM\ Also, FDMA system opeiating 
m the bandwidth limited regime cannot have increased capacity by hd'v mg a\ ailable 
increased downlink power Thus a FDMA system operating in the bandwidth hm 
ited mode does not benefit from additional satellites as a way of increasing capaciU 
unless everj mobile terminal is equipped with a costly directive antenna capable of 
proMdmg sidelobe rejection to result in adequate C/I performance in the adjacent 
satellite 


1 6 Trellis Coded And Convolutionally Coded 
Spread Spectrum Multi- Access 

In [10] the results indicate that convolutional codes provide superior performance 
than utilizing a trellis code [11] applied to the data svmbols It has been observed 
that gain of the convolution code improves as lower rate convolution codes are used 
Though in a band limited emironment application of low rate codes vould penalize 
the bandwidth, but for Spread Spectrum (SS) applications low rate com olutional 
codes can be applied without increasing the bandwidth or decreasing the processing 
gam [3] Advantage of low rate convolution codes over trellis codes is the greater 
distance property and it has been observed that additional coding gain accruing out 
of greater distance properties of lower rate convolution codes outweighs any other 
factor against it 
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1 7 Orthogonal Code Techniques For CDMA 


In [12] t^\o realistic (quasi ) orthogonal techniques have been compared The com 
paiison was clone using a single user matched filter receivers 
The two systems considered were - 

1 Walsh Hadamard Functions Based Orthogonal DS /SS CDMA s> stem 

2 Cold Functions Based Quasi orthogonal DS/SS CDMA system 

17 1 Walsh-Hadamard DS/SS system 

This particular s\stem has been patented by Qualcomm Inc A. carrier is assigned 
a Walsh Hadmard sequence [13] (M=64) and then the data is divided into identical 
I and Q components These components are overlaid with long PN sequences [pe 
nod L=32768] This overlav protects WH spread signal from possible asvnchronous 
interference due to multipath distortion or adjacent cell interference It also aids in 
code acquisition process 

17 2 Gold Functions based DS/SS system 

This svstem was patented by the European Space Agencv (ESA) The main differ 
ence from the previous system is that a single spreading sequence is used for code 
acquisition and users’ orthogonalization A QPSK signaling with independent I 
and Q spreading by means of two different Gold sequences is emploved The main 
adi antage of this signaling technique is the reduced bandwidth for a given bit rate 
and spreading factor The small residual self-noise due to the partial orthogonality 
of the Gold Sequences [14] is negligible as far as bit error rate (BER) and overall 
system efficiency are concerned 

It has been observed in [12] that theie is no clear winner in this comparison 
A minimum advantage of the WH functions based system in frequency selective fad 
mg channel is somewhat counterbalanced bj decreased sensitivity of Gold functions 
based system to implementation losses and nonlinear distortions No substantial 
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difference is observed as far as overall bandwidth efficiency and carrier phase offset 
sensitivity are concerned Further it has been suggested that s\stem level issues 
should be taken into considerations for an in depth comparison For instance ini 
tial code acquisition is presumably slower in WH functions based s\stem but other 
high level functions such as soft handover or diversitv reception ma\ be easier in 
the WH functions based system 


9 



Chapter 2 


CDMA Mobile Station Modem 
ASIC 


2 1 Introduction 



Figure 2 1 CDMA. MODEM block diagram 

Success of any satellite based communication system depends upon, low- 
cost, small-size and low power consumption user terminals This basic requirement 
calls for state of the-art modem architecture, design and implementation With 
the advent of portable cellular telephones a need to fit the maximum amount of 
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ciicuztry in one integrated circuit has been felt The continual improvement of 
application specific integrated circuits (ASICs) makes a single chip implementation 
of the complete CDMA modem possible The design of a single chip CDM A. Mo 
dem essentially calls for a high degree of cooperation between the Digital Signal 
Processing specialists for the architectural and algorithmic design and of VLSI de 
sign experts for the design of the chip lavout 

The CDM A svstem essentially consists of a base station s^ stem and a mobile 
station svstem Full duplex communication is possible iia a forward link (base 
station to mobile) and a reverse link (mobile to base station) Fig 2 1 shows block 
diagram of a CDMA Modem showing the \aiious stages involved in the forward link 
(demodulation) and reverse link (modulation) 

For the forward link in the mobile digital processing occurs when data 
flows from the A/D converter to the demodulator where received data are demod 
ulated and multipath combination is performed and finally to the deinterlea\er to 
reestablish the original data ordering then to Viterbi decoder where the symbols 
are soft decoded and error corrected 

For the reverse link m the mobile voice data are encoded m the vocoder 
sent to the modulator where the data are convolutionaily encoded, and interleaved 
to counteract correlated deep fades scrambled direct sequence spread, and FIR 
filtered These data are then finally sent to a D/A converter and are subsequently 
RF modulated and transmitted 


2 2 CDMA Modem Architecture 

2 2 1 Forward-Link demodulation 

The forward link demodulation system shown in Fig 2 1 has been redrawn in Fig 
2 2 The figure shows various stages, a searcher, demodulating fingers, a symbol 
combiner, a frequency error combiner, power control, a demterleaver, and a Viterbi 


11 



decoder included in the forward demodulation path in a CDMA Modem 



Figure 2 2 Forward link demodulation 

The searcher allows the Modem to detect and lock onto a pilot signal on the 
forward link and to continually search for other pilots (from other base stations) 
that may have better signal lei els 

Demodulation is performed using a three finger RAKE type receiver Each 
of the fingers is capable of demodulating a component of a multiple path signal 
tracking a single pilot by itself and maintaining its own timing reference indepen 
dent of other fingers At the output of each finger is a time deskew buffer that allows 
it to time track an aierage of four symbols awav from other fingers The symbols 
from each finger are realigned and then combined in the symbol combiner to form 
soft decision data 

The frequency error combiner combines the error signals from each finger, 
and uses the measure to bias the frequencj of the local oscillator 

The mobile station modem uses two methods to control transmit power 
open loop power control and closed loop power control With the first method the 
Modem attempts to estimate the path loss on the forward link and based on these 
measurements estimates the power required to transmit on the reverse link With 
the closed loop method reverse link power control information is sent m-line, at 
specified intervals, with the normal symbols being transmitted on the forward link 
Though this method is more accurate, it is not suitable for rapid adjusting of the 
mobile transmit power 

The resulting symbol stream is passed on to a block deinterleaver where 
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the soft decisions are deinterleaved The demterleaver spans either 20 or 80 ms 
depending on the rate of information being transmitted on the forA\ard link 

The deinterlea\ed symbols are passed on to a Viterbi decoder for an op 
timal decoding of the symbol stream The Viterbi decoding because of its yide 
applications in the communication fields and high hardware complevity will be our 
focus m this thesis 

2 2 2 Reverse-Link modulation 



Figure 2 3 Reverse link modulation 


The time tracking loop in the symbol combiner is responsible for mam 
taming reverse link timing which is synchronized with forward link timing The 
symbol combiner triggers the modulation of each frame of data to be transmitted 
on the reverse link Prior to each frame boundary a packet of data is written into 
the Modem Modulation of these data includes encoding interleaving scrambling, 
spreading, and filtering (refer Fig 2 3) For each 20 ms frame a data packet is 
transmitted by bursts into the input buffer of a convolution encoder of rate 1/3 and 
constraint length K=9 

The code symbols are passed through a block interleaver which spans a 20- 
ms frame The interleaving algorithm forms an array with 32 rows and 18 columns 
At 9600 bps, the interleaver forms a 32 x 18 matrix as in Table 2 1 
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Table 2 1 Interleaving algorithm 


At 9600 bps the transmission sequence is to send low bv row in a sequence 
order upto low 32 A.t 4800 bps the transmission sequence is to send by the unique 
order of rows as follows 


J J+2 J+1 J+3 
for J = 1+41 and i = 0 1 2 3 ,(32/4 1) 

A.t 2400 bps the transmission sequence is by a unique order of rows as 

follows 

J J+4, J+1, J+5 J+2 J+6 J+3, J+7 

for J = l+8i and 1 = 0 1,2, ,(32/8 1) 

At 1200 bps 

J+8, J+1, J+9, J+2, J+10, J+3 J+11, J+4, 

J+12, J+5, J+13 J+6, J +14 J+7, J+15 

for J= l+16i and i = 1,2 

The interleaved symbol stream is then converted into a set of orthogonal 
functions Each stream of six interleaved code symbols is mapped to a 64 bit Walsh 
function 


14 



35 


5 


lOQQl I lOl 100 


W Ish 
Lt p 
T hi 


WUCKie 35NVlhC)<J 
T am 


5fi 


This stream of Walsh chips is then scrambled bv BPSK spreading it with 
a pseudorandom (PN) sequence of bits produced bv a 42 bit pseudonoise sequence 
generator 

After the modulated symbol stream is BPSK spread it splits to form an I 
ph ise and a Q phase path for OQPSK signaling The I and Q phase paths are each 
spread with an independent PN sequence produced bv two lo bit PN generators 
programmed with different polynomials 

Reverse link transmission power is reduced by not transmitting the redun 
dant symbols that are present for data rates slower than 9600 bps Symbols are read 
from the block interleaver m a fashion which collects redundant data into groups and 
1 data burst randomizing algorithm uses the 42 bit PN generator to pseudorandomh 
gate off the redundant data stream during transmission 

Two 48-tap finite impulse response (FIR) filters are included m the modem 
one for the I phase data stream and one for the Q phase data stream These filters 
attain a stopband rejection of 42 dB The outputs of the two filters are truncated 
to 8 bits before multiplexing to a common output pm 

This completes our discussion of a CDMA Modem In next chapter we 
will introduce techniques involved in implementation of a Viterbi decoder Viterbi 
decoding and the architecture of a Viterbi decoder will be our focus in the remaining 
chapters 
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Chapter 3 


Implementation of Area Efficient 
Viterbi Decoders 


In this chapter we describe techniques for implementing a VLSI Viterbi decoder 
The design implementation can varv between two extreme forms a completely 
parallel approach based on a single ACS unit for each node and a totally serial ap- 
proach based on a single ACS unit for all the nodes We will examine the method 
employed for in place computation of the path metrics Also, the two forms of mem- 
ory organization - register exchange and traceback method will be compared Our 
main objective will be to optimize our design for minimum area without sacrificing 
the additional features of a Viterbi decoder e g bit synchronization and bit error 
rate (BER) measurement 

This chapter is devoted to the study of basic techniques employed in design 
of a, Viterbi decoder, in chapter 4 we will review some of the state of art \iterbi 
decoder implementations and in chapter 5 we will describe our implementation of 
the Viterbi decoder based on the discussions in chapter 4 and 5 and that best suits 
our specifications 
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3 1 Introduction 

The Viterbi Algoiithm emplojed for decoding con\olutional codes was discovered 
m 1967 Realized as an application of dynamic programming Viterbi algorithm 
has been used for estimation and detection problems m digital communications 
Recently it has been used in speech recognition where characters are modeled as 
hidden Markov models Earlier implementations of the Viterbi decoding algorithm 
were based on discrete circuits The size and complexity of these circuits limited 
the application of Viterbi algorithm to only deep space applications and to very 
expensive large digital communication systems [lo] 

The Viterbi Algorithm may be viewed as a solution to the problem of 
m iximum a ’posteriori probability estimation of the state sequence of a finite state 
discrete time Markov process observed in memor\less noise A tutorial on Viterbi 
algorithm can be found in [16] 

Recently a number of VLSI architectures to implement the algorithm have 
been suggested Broadh these architectures can be divided into two categories 

• node parallel and 

• node serial 

The Viterbi algorithm requires computation of a path metric for all encoder states 
K=constraint length) for each timing instant 

The constraint length K and the coding rate R of the coder-decoder are two 
important parameters that determine the coding gain Since the number of states in 
the trellis equals 2^'^“^the complexity of the Viterbi decoder doubles for e^e^y single 
increase m K Although a higher value of K yields a larger coding gam (4 8 dB for 
K=9) the achievable gam is limited by the speed and complexity of the decoder 

Assuming equiprobable input data sequences, the decoder chooses the path 
(i e the sequence x) that maximizes the log likelihood function, log P('\lx) The 
maximum likelihood function is difficult to implement efficiently, so typically the 
maximum likelihood decoder is implemented using distance measures In a binary 
symmetric channel the metric reduces to Hamming distance, while for eight level 
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soft decision it reduces to Euclidean distance Not more than eight levels of soft 
decision are used as eight levels of quantization results in a loss of less than 0 25 
dB compaicd to infinitely fine quantization and therefore quantization to more than 
eight levels yields little performance improvement [15] 



^ Code symbols 

Commutator 



Figure 3 1 (a) K=3,R=l/2 Convolutional encoder (b) Code trellis diagram 

The Viterbi algorithm, for memoryless noise, is a Markov process, and the 
probability P(a;fc+i|a:o, , 2 ;^) of being m state Xk+i at time k+1, depends only 
on the state at time k 

P(a;fc+i|a:o,a:i, .a;*,) = P(a:fc+i|2;/;) 
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A transition from state Xk to Xk+i is denoted as a state sequence (xq Xf , ) 
by X and a sequence of obser\ations by z 

The Viterbi algorithm can be defined as finding the state sequence x for 
which the a posteriori P(x|z) is maximum 

In Fig 3 1a description of Viterbi decoding known as trellis has been 
shown In a trellis each node corresponds to a distinct state xi at a given time 
(k) and each branch represents a transition to a new state xi^+i at the next 
instant time (k+1) For even possible state sequence there coiresponds a unique 
path through the trellis Gi\en a sequence of observations z we may assign a metric 
with X wheie x is the path in the trellis associated with z The metiic chosen md\ 
be proportional to In P(x,z) P(x z) maximum corresponds to P(x|z) maximum and 
theie IS a one to one correspondence between In P(x z) and P(x|z) Hence we ma\ 
assign a metric 

\{(k) = - \nP{xk+i\xk) -\nP{zK\^k) 

for each transition 

In the trellis a state sequence (rco,a;i Xk) corresponds to a path starting 
at the node Xo and terminating at Xk There will be in general, se\eral such path 
segments, each with some metric 

ViS)=x:SAte) 

The shortest such path segment is called the survivor corresponding to the node 
Xk and is denoted as x(xfc) The shortest complete path x begins with one of these 
survivors Viterbi algorithm finds the shortest route m the trellis At any time we 
need remember only survivors and their lengths r(a;fc) = A[a:)b] To get to time 
k+1, we need only extend the survivors at time k by one time unit, compute the 
metrics of the extended path segments, and for each node select the shortest 
extended path segment terminating m Xk+i as the corresponding survivor at time 
k-J-1 

Certain modifications become necessary, first, the input sequences received 
are very long, or infinite, and the survivor length (decoder memory) needs to be 
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truncated to some manageable length d Also if k (time instant) becomes large the 
metrics may grow indefinitely and it becomes necessary to renormalize T{m) 

A closer look at the trellis reveals that the transitions m am unit of time 
cm be segregated into disjoint groups of four each originating in a common 
pair of states and terminating in another common pair A typical such cell has been 
shown in Fig 3 2 The states at time k have been labeled as x’O and x 1 and the 
states at time k+1 as 0\ and lx’ ( x’ concatenated with 0 or 1 denotes the register 
contents at state k or k+1) 



Figure 3 2 Cell of a shift register process trellis 

A butterfly extracted from Fig 3 1 trellis diagram has been drawn m Fig 
3 3 



Figure 3 3 A butterfly extracted from Fig 3 1(b) 
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3 2 Implementation of a Viterbi Decoder 


3 2 1 Memory organization for path metrics 

Similai to in place computation ot discrete Fourier transform we can have in place 
computation of path metrics in a Viterbi decoder which obMites the need for a 
double buffer 

However a puce that is paid for this saving of memon is that we do not 
hme 1 straightforward addressing mechanism Let us consider that binary data 
IS fed to a decoder with number of states equal to eight The decoder has eight 
hvpotheses ending in 000=0 001=1 010=2 011=3 100=4 101=o 110=6 111=7 
Two successors (refer Fig 3 2) of 000 and 001 are 000 and 100 This means that 
wc read metrics from 0 and 1 and write them in 0 and 4 This is not in-place 
computation 


i 2 3 



Figure 3 4 Evolution of contents of Path memory for in-place organization (K-3) 
for three decoding cycles 


To put the metrics in place we devise an addressing scheme that changes 
after every decoding cvcle In general, at a time instant i the metrics are found by 
generating their natural addresses but rotating the bits of these addresses by i (refer 
Fig 3 4) places before reading or writing the metrics from or into the memory A 
cyclic shift of 1 places is identical to a cycle shift of i modulo K (constraint length) 

places 


21 


A. possible scheme for address generation has been shown in Fig 3 5 



Figure 3 o Address generation for in place computation 


3 2 2 Accelerating metric computation 

To employ some parallelism for the metric computation based on binary data one 
can divide the path metric memory m two blocks an even parity block and an 
odd parity block [18, 19] Thus, one metric will ah\ays be read from an even parity 
block and another metric from the odd parity block The even parity and odd parity 
blocks may be further be subdivided to further accelerate metric computation In 
general for M ary case each of the M possible modulo M digits will occur for one 
of the M addresses used together 

3 2 3 Memory organization for survivor paths 

Path memory processing consists of storing in path memories the d most recent 
decisions of the state processors and providing a mechanism for deriving the most 
probable transmitted information sequence from these decisions Two methods are 
commonly used for determining the transmitted sequence the traceback method 
and the register exchange method Most high speed very large scale application 
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specific integrated circuit ( VLSI ASIC) implementations of Viterbi decoder employ 
one of these 

Methods of reading the path memories 
• Traceback 


This requires that the decisions calculated for each symbol are saved 
in a path memory represented as a two dimensional array with symbol period iden 
lifymg a column of decisions and state identif\ing a row of sequential decisions 
The traceback is started from an arbitrary state The path is then extended back 
through the trellis bv recursivelj determining the sequence of the states that form 
the path It IS necessary to traceback decision depth d to ensure that all paths have 
merged with the correct one A path memor> of depth 4 oi 5 times the constraint 
length IS sufficient for negligible degradation from optimum decoder [15] 

Traceback involves reading of columns of decisions and selecting one 
bit from each column indexed on the current state to update the current state and 
continue the traceback Higher speeds are met b\ implementing 2d tracebacks 
generating d valid decoded bits, while a further d bits are received at the decoder 
and processed by the ACS block 

• Register Exchange 

The register exchange method is based on the movement of hypothesised in- 
formation sequences within the path memory This is facilitated by implementation 
of the path memory as a set of trellis connected shift registers Rather than storing 
m memory the column of decisions calculated for each symbol and then determining 
the path by subsequent access to the memory like in the traceback method, the 
register exchange method moves the hypothesised informati on sequences or paths 
from state to state as each received symbol is processed The new decision calculated 
by the AGS processor for each state is added to the beginning of the path currently 
associated with each state The merging of two paths at a particular state manifests 
itself as a duplication of the path up to that point m two different registers Just as 
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with the traceback technique a path of length equal to the decision depth must be 
determined before valid decoded bits are obtained 

Traceback versus register exchange 

With 2d tracebacks, generating d valid decoded bits traceback requires a total of 
3d2^ bits of storage and has a latency of 2d Speed advantages higher than 2 1 
can be used to reduce storage and latency A major cost of traceback technique is 
the considerable additional circuitry needed for RAM control decision bit selection 
ind output buffering The RAM is operated as a circular buffer and. the output bit 
buffei is essentialy a FIFO The additional circuitry consists largely of random logic 

Implementation of register exchange requires the theoretical minimum la 
tenc\, d svmbol periods, and storage of d2^~^ for a code with states The 
register size is equal to the decision depth d with one register being used for each 
state 

The exact airangement of the pipelined register exchange depends on the 
desired speed of operation In addition to trellis wiring and register cells, the reg 
ister exchange process requires multiplexers to select surviving paths To achieve 
the highest possible speed in the path memory an instance of trellis wiring and 
multiplexers must be placed between each column of decision bits At reduced 
speeds several bits of register may be placed between each multiplexer and trellis 
connection 

Register exchange offers the potential for very high decoded bit rates, while 
traceback is competitive for all other bitrates and is more closely linked to the 
current automated layout tools and memory technology that dominate the ASIC 
arena Implementation of the register exchange method becomes viable for longer 
constraint length code only through the use of the state labelling 

3 2 4 State metric representation and normalization 

Normally 4 or 5 bits are used for state metric representation To prevent an un 
bounded growth of the state metric, it becomes necessary to renormalize the state 
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metric from time to time 

In the Viterbi algorithm only the relative magnitude of the state metrics 
IS important As a result an offset positive or negative, can be added to all the 
state metrics in any symbol One of the techniques has been to reduce all state 
metrics by a certain amount in case thej all exceed a certain threshold value For 
example, the MSB of every metric register is monitored, and if they are all one 
normalization occurs bv removing the MSB of all metric registers One of the novel 
techniques suggested m [17] has been to use two s complement as an alternatue 
to the normalization method The registers are designed to store metrics of length 
greater than 2Dmax^ where D^ax is the maximum possible difference between state 
mctiics By using two s complement arithmetic the overflows can be accomodated 
Though it IS computational!} faster and gives an increased throughput the appioach 
has been avoided for reasons which will be made e\ ident in the next section 

3 2 5 Bit synchronization and bit error measurement (BER) 

A Viterbi decoder ma} have a synchronizing unit [20] for determining the beginning 
of the received symbol stream and if required to resolve the symbol polarity ^ A 
1/2 convolution encoder encodes each binary bit into two symbols, and the output 
stream can be grouped into packets of two symbols as shown 

, Sfc_iSfc, Sk+lSk+1^ Sk+3 

01 

Sk-l, SkSk+l,Sk+2Sk+3, 

The synchronizer resolves between the two data streams 

A synchronizing unit relies on an input error indicator for synchronizing the 
input data stream The input sequence errors can be indicated by either monitoring 
the rate at which the path metrics are increasing or by using the merging properties 
of the trelhs A decoder out of sync tends to have surviving paths which merge 
slowly than a decoder m sync Any such condition when detected is fed to the 

'■where non transparent odes are used 
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synchronizing unit for correction 

One possible circuit for monitoring input sequence error rate has been shown 
in Fig 3 6 Here the bit synchronization is a two step process A counter initialized 
e\ternally by the microprocessor monitors the normalization rate The counter is 
clocked eveiy time the path metrics are normalized The overflow output of this 
counter is used to indicate out of sync condition The external count fed to this 
countei sets an in sync/out of sync threshold for the Viterbi decoder Another 
counter also initialized externally, is clocked by decoded data output clock The 
01 - ei flow output of this counter is used to indicate in sync condition 



nicPldai ar tbytl l mai m proc w d 
t ih mod m 

Figure 3 6 Normalization rate monitor circuit 

The second step of the automatic synchronization process attempts to cor 
rect an indicated out of sync condition by offsetting the data input to the decoder 
prior to the actual decoding process 

Twos’ complement arithmetic destroys the reliability indicators, viz normal 
ization rate or trellis merging information, that can be used in the synchronization 
loop for synchronizing the input sequence stream, and hence is not suited for Viterbi 
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decoder where bit synchronization unit forms an integral part of the design 



Figure 3 7 Re encode and compare circuit for channel BER estimation 

By introducing a delay unit and a channel encoder which re encodes the 
decoded data stream from the decoder a comparison unit gives a measure of channel 
errors by making a bit by bit comparison of the re encoded data stream and the 
delayed input data stream A possible configuration for measuring channel bit error 
rate (BER) has been shown in Fig 3 7 

This completes our discussion of the Viterbi decoding algorithm In chapter 
4 we will compare the various techniques involved in the implementation of the 
Viterbi decoding algorithm 
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Chapter 4 


Area Efficient Viterbi Decoder — 
Design Issues 


In this chapter we will compare some of the approaches taken in the implementation 
of the Viterbi decoders However, our discussion will be confined to designs that 
ha\e been fabiicated and tested Wherever possible we will discuss the motivation 
behind a particular approach 


4 1 A comparison of Different VLSI Viterbi De- 
coder Designs 

4 11 Introduction 

Broadly speaking the different realizations of Viterbi decoders may be divided into 
two categories, viz node parallel and node serial [21, 22 23] In the node parallel 
approach, computations for all nodes in the trellis are done simultaneously In the 
node serial approach, computations are done sequentially by the node and the de 
coding operation is therefore much slower There have been several implementations 
atempting at tradeoff between area of implementation and the number of nodes A 
number of ACS units, say P, where P is less than the number of nodes N, are shared 
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b} the nodes in carrying out the branch metric computation One attractive feature 
of this method is that it allows a linear tradeoff between area and the speed [24] 

For highest speed decoding a fully parallel processing implementation with 
one processor per decoder state is used These processors or ACS units decide which 
of the possible information sequences entering a state is more likely to survive The 
processors need to be interconnected according to the trellis diagram of the code 
this gives rise to a problem of different kind where reduction of the wiring area 
on a silicon chip is the mam aim This will be discussed in the next section of 
the chapter The code memory depends on constraint length K as and this 
exponenti il dependence on code memorv length forces a tradeoff betwen speed and 
constiamt length of the convolutionally encoded code that can be implemented on a 
single chip To reduce global wiring which may occupy 37% area of the chip several 
designers have proposed locally connected processor architectures that completely 
eliminate global interconnects [29] and also reduce the number of ACS elements by 
nrocessing more than one state per processor [30] For increasing decoder throughput 
multiple decoders have been run on either mterleaaed data or blocked data 

Another other aspect that has been investigated has been the design of the 
path memorv processing There are two methods in common use register exchange 
and the traceback method Register exchange method has been used in high speed 
ASIC design foi 25 Mbit/s NASA (2,1,7) decoder and traceback has been used in 
17 Mbit/s decoder developed by Qualcomm [21] There is still a disagreement as to 
which of the two methods gives the most efficient V LSI implementation 

4 12 Algorithms and the architectures 

VLSI grid model can been used to have an account of the chip area In the VLSI 
grid model a VLSI chip may be viewed as a computation graph whose vertices 
are called nodes and edges are called wires Nodes can be considered as processors 
responsible for the computation of Boolean functions The wires are simplj electrical 
connections responsible for transfer of information between the nodes and also for 
distribution of power supply and timing information to the nodes The area of the 
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circuit IS defined as the area in unit squares of the smallest rectangle enclosing the 
circuit Two schemes based on the shuffle exchange (SE) and cube connected cycles 
(CCC) die identified for the placement and routing of the required ACS processing 
dements Discussion and applications of shuffle exchange organizations can be found 
in [25] and [26] 

The shuffle exchange graph consists of 2'"“^ nodes each of which corre 
spond to one state of the single stage trellis Each node labeled 0 to /c — 1 is 
issociatcd with an equivalent k — 1 binary string Sk -2 Two nodes k 

rnd k' are linked via a shuffle edge or an exchange edge A. shuffle edge links 
two nodes k xad k' if k and k' differ onh in the first bit i e k = sc- 2 , Si so 

nid k' = hk- 2 , disb An exchange edge links two nodes k and k' if k — 
bk -2 A ^>oand k' = Sk -3 So Sk -2 The Shuffle Exchange graph can be 

related to a single stage trellis Assuming the number of states 2‘' the states mav 
be labeled from 0 to 2'"^ respectively Each initial state identified by an equivalent 
bit binar} label Xu-i,Xu- 2 t Xi,Xq maps into tvo states >^o and 

) a-o alTT The edge from an initial state to the two final states can be con 

sideied as a shuffle operation mapping state Xi XotoXu- 2 , Xq,xi, 

or a shuffle operation followed by an exchange operation mapping the initial state 
to state x^- 2 i ,a:o,sTT Expressed in terms of Shuffle and Exchange the input 
node k IS connected to nodes k' and k" where 

k' = Shuffle{k) 
and 

k" = Exchange{Shuf fle{k)) 

This interconnection structure can be implemented on a Shuffle Exchange network 
A cycle of shuffle edges in a Shuffle Exchange graph is known as a necklace The 
architecture is cleaved into cycles of shuffle edges to generate a layout for embedding 
the Viterbi algorithm in silicon 
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Figuie 4 1 (^) Single stage trellis with recirculation edges (b) Graph unfolded (c) 
The Vdd Compare Select nodes m (a) separated (d) Grid model lavout for the four 
node 2 SC graph 


Fig 4 1 shows grid model layout for a single stage trellis with recirculation 
edges In Fig 4 1(b) the nodes have been arranged into a planar graph In Fig 4 
1(c) only those nodes which serve as wiring distribution points have been split and 
stretched Fig 4 1(d) shows a 2 SE^ that has been generated for embedding the 
Viterbi algorithm The necklace has been drawn as a dashed rectangle consisting 
of two long vertical segments and two unit length horizontal segments The two 
exchange edges have been shown as solid horizontal lines 

The second architectural strategy proposed is based on the cube connected 
cycles (CCC) However, CCC results more interconnect wire area than the SE ar 
chitecture 

It has been observed in [27] and in [28] that applying the optimal shuffle 
^ since the number of nodes is a power of two the graph is referred as 2 SE (Shuffle- Exchangej 
graph 
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e\chmge layout proposed in [29] yield la\outs ^\^th poor utilization of silicon area 
as such layouts cannot be compacted Tins is as a result of constraints imposed 
by ibuttmg nodes and the distribution of globd signals (buses clock signals and 
power supply) which are irregular and hence take greater silicon area 



(c) (d) 


Figure 4 2 Steps leading to ring topolgy (a), (bt organization m pairs, (c) Identifi 
cation of Hamilton cycles, here (0,1,3 2,0), and the (d) Resulting two column floor 
plan 

I 

I 

In [28], It has been suggested that the kCS elements be arranged to form 
j one directed cycle, m which one of the two outgoing path metric buses of an ACS 

1 elements connects directly to the neighbor The ring has been organized as two 

a 

I 
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columns as shown in Fig 4 2 Based on the concept of a ring organization a Hamilton 
( yclc^ is found for the deBrmjn graph 

The two ring topolgy has the advantage of reducing the routing of the buses 
into a channel i outing problem Further the regularity of the structure allows easy 
\eitical routing of the global signals Determining Hamilton paths is a NP complete 
problem The application of this concept is therefore limited to 32 ACS units 
Be\ond that value the computational comple'siity prevents an exhaustive search of 
til the Hamilton cycles 

The thioughpiit of a Viterbi decoder can been increased b\ applying looka 
lit id (o both A.CS ind trace bad recursions [31] A 2^^“^ state trellis can be 
itcr itcd Iroin time index n k to n bv decomposing the trellis into subtrel 

hscs cull consisting of k iterations of a 2^ state trellis Each 2^ can be collapsed 
into in equivalent one stage radix 2^ trellis by applying k levels of lookahead to the 
lecuisive update 



Figure 4 3 (a) 8 state radix 2 trellis (b) 4 state subtrellis decomposition (c) 8-state 
radix 4 trellis 

An example of the decomposition for an eight state radix-2 trellis into an 
equivalent radix-4 trellis using one stage of lookahead is shown in Fig 4 3 
^Hamilton cycle is a path which goes through all of the vertices once and only once [32] 
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If the delay through a 2*- and 2 way ACS are equal a possible L fold speedup 
IS ichicvxble for a complexity increase of 2*'~^ m the ideal case Ho\\e\er the actual 
speedup oblamcd has not been mentioned m the original paper [31] It has been 
obscivcd tint ladix 4 case corresponds to the best tradeoff between area complexity 
xnd the speedup This lesult is technology dependent and technologies better than 
1 2 /jm may see higher radix lookahead implementations 

In addition to the total serial and total parallel approach there have been 
attempts to have mixed designs A bit serial architecture with parallel nodes has 
been given m [33] The node parallel architecture carries computations in a bit- 
seiiil m inner The reason foi using bit serial manner has been to reduce wiring 
icquirt im nts ind to illow interconnections of the chips for larger constraint lengths 

4 13 Path memory processing 

As mentioned the two common methods of path memory processing viz register 
exchange and tiaceback method have been used in the design of high speed decoders 
Latency constiaints in traceback method have been reduced in hardware by dixiding 
symbol time into multiple reads of the past decisions and one write of the current 
decisions In a hardware implementation generally, 2d tracebacks are performed, 
geneiatmg d valid decoded bits, while a further d bits are received at the decoder 
and processed by the ACS unit Register exchange because of its very nature, 
introduces the cost of tiellis wiring However it is possible to reduce trellis wiring 
by relabelling the states within the region The total reduction of wiring area using 
this technique can be as high as 66% [27] 

4 14 ACS arithmetic and normalization 

The number of bits used to implement the ACS arithmetic depends on the 
maximum path metric range The intricate process of normalization can be avoided 
by using twos complement format and ignoring the overflow This approach has 
been used m [31], where the number of bits required to implement state metric has 
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been obtained as 


r6tis — log 2 6^03 + A,\rna 2.+1 

Where b< \ nax^oejiN IS the bound on the maximum dynamic range of state metrics 
N lb the number of states and Amax is the maximum branch metric for the radix-2 
trellis As has been discussed in chapter 3 twos complement arithmetic leads to 
the loss of indicators used for bit synchronization and mvanablv all commercial 
decoders use the normalization rate to automatically synchronize the input data 

In [34] the most negative state metrics are preserved howe\er to prevent 
I he St itc mctiics from going too negative a positne offset is added to all the metrics 
to preserve then relatne magnitude To pre\ent positive overflow large positue 
s( ite metnes arc hard limited To avoid information loss which can result if all 
strtc metrics no hard limited, a negative offset is added to all the state metrics if 
1 hcv til move uvay from the negative end 

In [28] a dynamic solution to the state metrics normalizations is given 
Path metrics have been represented by Tbits and branch metrics bv 4 bits A 
global prcchargcd wire exists that any ACS can discharge if its path metric is less 
than 32 The signal is input to a register in each ACS element, and this register 
drives the three most significant bits of the branch metric operand to the adder 
Resc ilmg is initi ited if the minimum path metric exceeds 32 

This concludes the discussion on the various techniques that have been used 
for fabricating Viterbi decodeis as VLSI chips In the next chapter we discuss our 
implementation of a 19 2 kbps, 256 state Viterbi decoder We will also discuss 
the suitability of Field Programmable Gate Arrays (FPGAs) for obtaining a rapid 
piototype of a Viterbi decoder 
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Chapter 5 


Design of 19. Kbps, 256 State, 
Area Efficient Viterbi Decoder 
Using High-Level Synthesis 


5 1 Introduction 



DECODER OUTPUT 


Figure 5 1 Viterbi decoder block diagram 
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A major contiibution of this thesis has been in the design and implemen- 
i It ion of i Vitcrbi decoder A Viterbi decoder is a very important block in any 
CDM V modem The urn was to design a 19 2 kbps 256 state Viterbi decoder with 
iddc d ( ip ibility of eateimg to higher input data rates In the literature, the discus- 
sions h xvp been piimaiily focussed towards the analysing of a proposed architecture 
1 xther than towards its implementation, this is probably due to the proprietary na 
tuic of the final implementation To the best of our knowledge none of the existing 
lilciiture diseusses Viterbi decoding implementation based on high level synthesis 
t ugitted foi Tield piogiammable Gate Arrays (FPGAs) This has been the focus 
in th( pi CSC III thesis 

Besides (he ibove, the issues of the organization of a path memory, deci 
Sion me nioiy, I he decision meinoiy reading techniques and the clocking mechanism 
hue been left unanswered Some of the implementations have used twos comple- 
ment lopie seiitation foi the path metrics However, this is achieted at the cost of 
ekbtiovmg an important information used for input bit sequence synchronization 
The urn wis to letim this bit synchronization information, eyen though it made 
the norm ili/ation of the path metrics essential Normalization of the path metrics 
is a difficult operation and to implement this feature there are two options- either 
to use 1 special time slot for this operation or else to use very fast subtracters As 
I ( sen mg a time slot slows the decoding operation and hence the decoding speed 
the hftci option was chosen 

In this thesis we explore an implementation methodolgy for rapid prototyp- 
ing We (hose B''PGAs because of their reprogrammabihty and short programmabil- 
ity times With FPGAs it is possible to immediately assess the impact of various 
design changes on the final architecture However, designing with coarse grained FP- 
GAs such as Xilmx FPGAs, which we have used in our implementation is totally 
difterent fxom standard cell based or mask programmable based ASIC designs 

In this thesis we highlight the possibility of efficiently implementing m FP- 
GAs very complex digital systems with stringent design constraints, such as Viterbi 
decoder The implementations are efficient both in terms of the number of logic 
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blocks used ind also in terms of the timing constraints which need to be satisfied 
In designing the Viterbi decoder we have used some of the features dis- 
cussed in tlic litti itiUL howevei we have attempted to compare their merits with 
ollui existing techniques and also the advantages that have accrued because of their 
use It has not been oui attempt to give just another implementation of the Viterbi 
xlgoiithm but to create design guidelines which may be used in any future imple- 
mentation of the decoder The methodolgj that we ha\e tried to develop doesnot 
limit its usefulness to only the Viterbi algorithm but can be used for the rapid- 
piototvping of any Digital Signal Processing algorithm 

W( have used Veulog a high level hardware description language for the 
Rll d(S(iiption of 0111 Viteibi decoder We found that writing a synthesisable 
(od( V IS tot ally diffcient horn writing the simulation model needed for validating 
and Miilying the intended aichitecture We found that synthesisng tools support 
a subset of the oiiginal language features This can necessitate a total rewriting of 
llu ( uli( 1 RTL description We have also highlighted a few techniques that we have 
dc\( loped to ciicmnveiit some of the limitations of the synthesis tool Pinnauq It 
has be t n oui expc uence that synthesis tools which support RTL description perform 
bcttci ill an synthesis tools which support behavioral descriptions 

In this chapter we will discuss a design flow by which we can closely relate 
llu cxploKcl aichitecture and the VLSI implementation Central to this method- 
ologa IS llu synthesis of tlm logic ( the VLSI implementation) from the code writ- 
ten m Vcnlog (abstraction of the architecture) The synthesis is performed in an 
cxploiatory matinci, in order to arrive at an architecture solution with improved 
pciformaucc m terms of area and delay constraints 

One of the advantages of writing the RTL description in Venlog was the in- 
built random number generators which return integer values distributed according to 
the various probabhstic distributions e g , PozssonMnhuUonQ, UmformMnbn 
lionQ [ 35 ] These functions weie used not only to generate the input data stream 
while testing the proposed architecture, but also to simulate the noise bursts that 
affect the real time data transmission These functions return the same value given 
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the same seed and this feature facilitates debugging of the design by making the 
system inputs repeatable In the decoder the 8 le\el quantized data is decoded 
optirnilly by the Viterbi algorithm Two ACS pairs have been used to process the 
dit i sen illv Oui decision for two pairs was motivated by a desire to reduce the 
total amount of logic gates used Howeaer using a single ACS unit was found to 
increase t he interconnections and also the benefit of the butterfl\ structure of the 
Viteibi algorithm was lost Decisions from the ACS units are stored m a partitioned 
internal path memor} A trace back through the path memory of depth 64 outputs 
1 single data bit The path memory has been partitioned into eien parity and 
odd puilv memoiies The mam drawback of this configuration is the complicated 
niemoiy decoding strueture Howeier, this is more than offset bv the reduction of 
child iiiteieonneetion lesources and the increase m the resources utilization Fig 
5 1 shows the top level block diagram of the Viterbi decoder The implementation 
of Ihc vaiious blocks has been discussed in the following sections 

5 2 Design Methodology 

The algorithms have been specified in Verilog, a hardware description language The 
Verilog code gives the register transfer level (RTL) description of the architecture 
D ita generators and pseudo noise genertaors written m Verilog are used to specify 
the inputs during the simulation of the architecture The inbuilt functions of Verilog 
aic used to check the coding gam and hence the suitability of the architecture 
High level description allows to verify the suitability of various approaches e g m 
computation of the branch metric, the address for starting the traceback, the size of 
the traceback memory, etc , Implementation into hard wired architecture is done in 
an exploratory fashion This is done by a juxtaposed simulation of the DSP system 

specification and the RTL representation 

Fig 6 2 shows the proposed methodology used for the synthesis and analysis 

of Viterbi the decoder orenrt To create an FPGA implementation the imtial Verilog 

file was modified to be acceptable to our Verilog FPGA mterfaemg tod Pinnauq 
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The Verilog code is compiled together with the Xilmx technology libraries and the 
result IS saved in the \ihnx nethst format (XNF) file format XACT the Xihnx 
developnu ntal system is then run on the XNF file to create a configuration file for the 
FPGA The performance of the synthesised design is obtained and this information 
IS then used to hand tune the Verilog RTL code of the Viterbi decoder so as effect 
the desiied change through Pinnauq 



Figure 5 2 Proposed FPGA design flow 


5 3 Decoder Implementation 


5 3 1 Specifications 

1 . i-iio Vi+prbi decoder to be implemented are as follows 

The specifications chosen for the Viterbi aecoaei lu n 
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1 K (constraint length) = 9 (256 states) R = 1/2 convolutional decoder 

2 G( nci xtor polynomials = 561 and = 743 7 (codmg gam)=4 77 

3 8 level soft decisions inputs 

4 Survivor path length 64 

In addition to these specifications the decoder should have the capability of 
sell sx iichroniz ition m case the synchronization is lost A microprocessor initializes 
(Ik dieodei it ilso sets a threshold difference between the largest and the smallest 
p it h mt 1 1 le A St 11 synehi onizing circuitrjr takes a corrective decision if the difference 
Ik (vein (lu two metrics does not exceed the user defined threshold Moreover the 
imtiopioc c ssoi cm read the rate of the growth of the metrics before initiating anj 
collective measures 


5 3 2 Branch metric unit 
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riguie 5 3 Block diagram of a branch metric unit 


The two 3 bit soft decisions are combined to form four possible 4 bit branch 
metrics 1 e , A(00), A(01), A(10) and A{11) The branch metrics are generated as 
the Hamming distances of the input data stream from (00), (01), (10) and (11), for 
example Hamming distance of a bit stream ‘101000' tiom symbol ‘10' is 2 and from 
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the symbol ‘Ol’bs 12 The maximum value the Hamming distance can take is 14 
ind th il c\pl iins using four bits for the branch metiic The block diagram of the 
bi inch mcliic unit Ins been shown m Fig 5 3 

While the p itli metrics are being generated the branch metric unit is used 
to genei ate the bianch metrics of the two new 3 bit data stream The four bit path 
metiics ue stoied in four 7 bit registers (the thiee hISBs of the registers are made 
zero) 


5 3 3 Add compare select unit 
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Figure 5 4 Block diagram of the ACS units 


Aiea efficient 2 way ACS units have been designed using serial one bit 
addcis (lelci Fig 5 4) The two comparators have been implemented by using serial 
one bil subtiwtcrs, it ideally suits our requirement as it results in a pipelined 
sUiuUuc The outputs of the subtractors are lateched into two 7 bit registers 
The MSB of these legisters aie sent to the decision memory and are also used to 
choose between the two competing paths The adder outputs are held m four 7 
bit registers The MSB of the surviving path metric is used to detect whether 
normahvalion is required or not The path metrics are stored m the path metric 
memory w.th each location of length 7 bits A control unit holds the value of the 
smallest and the largest path metnc m two registers, which are accessible to an 

external micioproce ssor 

‘Symbols '10 rnid 01' in 3 bit qTtieed levels are equiJ to 111000 and OOOIll respcctnely 
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The normalization is initiated only during the next iteration Normalization 
IS (lone by subtr icting the smallest path metric from the path metrics routed to the 
ACS mills during the next iteration 


5 3 4 Path metric memory 
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Figuic 5 5 Address generation for in place computation of the path metric memory 
contents 


To ucruc benefits from the butterfly structure of the Viterbi algorithm, the 
pith mctiic memoiy has been divided into two blocks viz even parity block and 
tlu odd p irity block of sizes 128 x 7 each This allows parallel access of the two 
pith me tries loquiicd for the computation of the next path metric The control 
unit scrambles the path metric memory addresses before accessing the memory, this 
allov s mpla( e computation eliminating the need for a temporary buffer Fig 5 5 
shov s an example illustrating our concept 

5 3 5 Decision memory 

The traccback memory has been organized as a two dimensional structure, both the 
rows and columns can be accessed Fig 5 6 shows the organization of the decision 
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mcmoiy Tlio numba of the tows is equal to 2' Each column corresponds to one 
si m (he iicllib 
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rifjUie 5 6 structure of the decision memory 


The totul iiicmoiy has been divided into 4 banks with each bank of size 2'^ x 
32 dibits ’ (t h( dc ( ision memory may be viewed as a three dimensional organization) 
Pow( 1 ( onsumpi ion is u duced at the architectural level by beginning each cycle with 
the block uliluss decoding and then enabling the two blocks (one for reading and 
the second for wiitmg) lliis subarray architecture not only, results m a faster 
implement ition, but also consumes less power, since the number of banks active at 
a time IS limited to two and hence less capacitance is switched per read/write cycle 
The only drawback is the area penalty due to the increased overhead of the decode 
logic, tho control logic and their routing interconnections [36] 

The data is decoded in bursts The data is available after a traceback 
through two banks, and during the traceback through the third bank, the data i 

^Dach dibit is a group of two bits 
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decoded It the late of 3 bits per input symbol The decoded data is written into a 
two slack stuicture which peiforms both the bit reversal as well as the elimination 
of the buists The fiist decoded symbol appears after the first 128 bits have been 
processed by the decoder thus, making the output latency equal to 128 A single 
st lek siiueture can be used if read and write operations are interleaved or a dual 
port memory is employed for the stack However it complicates the circuit design 
md lienee was replaced by a double stack structure 


5 4 Clock Mechanism 

V Singh phisc clocking scheme was preferred because it results in a simpler design 
stch with leduced routing overheads Clock signals which drive each module are 
obt lined fiom a common source and routed inside the chip to these modules The 
common source clock operates at a high frequency and the slower internal module 
clocks die obtained from counters which divide the common source clock 

rPGA.s, because of their fixed geometric structures, lack the flexibility 
of other ASIC implementation technologies with respect to the clock distribution 
scheme To ic duee clock skew it is possible to distribute the clock pins during place 
me lit of the logic blocks such that the load capacitances in the clock distribution 
ti( { uc baldiuccl This is possible with Pinnauq by defining a constraint file while 
(1C itmg the net list file (XNP file ) 


5 5 Rapid Prototyping of the Viterbi Decoder 
with XILINX FPGA 


The Vilcrbi decoder was implemented using the Xilinx XC4000 [37] (Appendix 
B) FPGA series FPGAs allow shorter design turnarounds and reduced verification 
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times, which can potentially result in large design savings XC4000 series was chosen 
not only bee luse it is supported by our synthesis tool Pinnauq a high level synthesis 
tool but ilso because of its advanced and fle\ible architecture The generation of 
the \Nr destiiption ( oi nethst) of our design was done with Pinnauq 

Once the pioper XNF description has been generated this mapping to 
the \C4000 series is straightforward using the Xilinx supplied tools for file format 
com ei Sion The design is translated to a logic cell array {LCA) ^ file from where a 
semi bitstream file for configuiing the CLBs is generated 

The Xilinx \C4000 senes contains special purpose hardware to efficiently 
iinpk incut fist eari} logic as found in addeis subtracters, counters and other re 
lilcd luiution blocks Normally little algorithmic advantages can be gained by 
substituting 1 supciioi description for such a function block It is also diffi cult for 
HDL compilers to use special purpose features which are available in FPGAs under 
(cit un conditions such as fast carre look ahead logic or the builtin RAkI 

Choosing the right implementation style for FPGA design can cause a huge 
dillertrico in ic source utilization We tried various algorithmic descriptions for our 
functional blocks e g serial one bit adders and subtracters The results obtained 
aie given m Table 5 1 


Description 

FMAPS 

HMAPS 

lOBS 

Other 

Resources 

ADD (XOR) 

2 

0 

6 

0 

ADD ( + ) 

1 

0 

6 

0 


Table 5 1 Resource utilization for different styles 

Xilinx provides a partial solution to this problem by supplying a library of 
adders, subtracters, counters, and comparators In Pinnauq, these libraries are used 
to implement complex functional units 
3XC4000 in our case 
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Pinn xuq uses \ BLO\ library functional units [38] by including references 
to the \ BLOX modules in the Verilog code During the processing by XACT 
\ BLOX IS invoked as module generator to synthesize appropriate functional units 
Fig 5 7 shows how a X BLOX description has been included in the Verilog 
dcsciiption of one bit adders 


module oneJ)it_adder(a b reset clock sum) 

input [0 0] a b reset clock 
output[0 0] sum 

reg [0 0] tarry_m 

wire [0 0] sum 

wire fO 0] carry_out 

AI)D_SUB_l_UBIN adder! A(a) B(b) 

C_ IN(tarry_m) rUNC(sum) ADD_SUB(1 bl) C_OUT(carry_out)) 

always @(posedj,c reset or posedge clock) 

ir(ieset== 1 bO) 
carry_m = 1 bO 
else if(clock == 1 bl) 
carry _in = carry_out 
cndinodulc 


Figuie 5 7 Including X BLOX instantiation in a Verilog file 


While using the X BLOX library helps optimize some circuits however 
most ol the circuits cannot be optimized to use these features A solution is to 
include X BLOX Xilinx design library elements as components in Verilog and use 
cither available circuits in the X BLOX libraries, or to generate one s own circuits 
with XACT Ol a schematic entry tool to be included m the design This feature was 
readily suipported by Pinnauq and we optimized our circuits by including X BLOX 
design libiaries m our Verilog description 
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Chapter 6 


Conclusions And Future Work 

6 1 Conclusion 

Vitcibi decoder is one of the most important blocks m a CDMA modem In this 
thesis we have designed and implemented the Viterbi decoder This has been tar 
getted loi FPGA implementation The methodology that has been de\ eloped is not 
sp( ( ihc to ]ust the Viteibi decoder but can be used for implementation of any other 
Digital Signal Piocessing (DSP) algorithm We believe the same approach can be 
use d in the design and implementation of the entire CDMA modem chip Some 
nifthods for accelerating the decoding rate have been developed and these methods 
( in b( used as guidelines in any future implementation of Viterbi decoder Our 
aim wis to develop an area efficient 19 2 kbps, 256 states Viterbi decoder, but the 
cU sign dc velopc d can cater to higher input data rates Some of the existing features 
such is input bit sequence synchronization, bit error measurement and self testing 

featuiPS have been incorporated 

6 2 Future work 

The implementation methodology used m the design of a Viteibi decoder can be 
extended to the complete design of a CDMA modem chip The design was targetted 
tor FPGAs FPGA area optimization techniques are different from ASIC based 
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dcsibn optirni/dtions techniques This feature of the FPGA design needs to be 
close 1} looked into Presently there are no guidelines available for writing HDL 
d( sc iiptions which can result in an optimal FPGA implementation We believe that 
this Mill be \ciy much dependent on the synthesis tool used and on the FPGA 
uchitectuie For example, using \ihnx X BLOX libraries seems to result in an area 
dheicnt implementation however the impact of including X BLOX descriptions m 
t he high It 1 el RTL description needs to be closely assessed Minor restructuring of 
the RTL description can lead into tremendous variations in terms of the number of 
logic bloc Is used This feature needs to be investigated 

The Vile ibi decoder design because of its memory requirements needs more 
(Inn one FPGA loi its implementation and this necessitates partitioning of the 
de sign Be e ruse oi high interconnection delays which can slow the decoding rate 
p uliiioiimg i single design into a multi FPGA targetted design is a real challenge 
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Appendix A 


Spread Spectrum Techniques. 



(b) 


Figure A 1 (u) Direct sequence system (b) Frequency hopping system 

A spread spectrum system is one m which the transmitted signal is spread 
ovei a wide frequency band, much wider, than the minimum bandwidth required to 
( ransmit the information being sent 

Three general techniques are recognized as spread spectrum signalling meth- 
ods 
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1 Modul ition of d earner by a digital code sequence whose chip rate is much 
higliLi than the information signal bandwidth Such systems are called direct 
sequence modulated systems 

2 C line I ficquenev is shifted in discrete increments using a pattern dictated 
by the code sequence These are called frequency hopping systems The 
trinsmitter jumps from frequency to frequency within some predetermined 
set, the Older of frequency usage is determined by a code sequence 

3 Pulsed FM oi chirp modulation in which a carrier is swept over a nude band 
duiiiig i given pulsed mteival 


w. AlUS'Jl 
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Appendix B 


Using FPGAs for 
Application- Specific Digital 
Signal Processing Application - 
Viterbi Decoder A Case Study 

B 1 Introduction 

rPG Vs aie an alteinative and a viable solution for implementing DSP algorithms 
which have tiaditionally been implemented using programmable DSP chips or hard 
wired application specific integrated circuits based on the standard cell or mask 
progiammable gate array design methodology FPGAs combine the flexibility of a 
general purpose DSP and the speed, density, and low cost of an ASIC implemen 
t ition The FPGAs, while retaining the advantages of a customized functionality 
that an ASIC provides, avoids the expensive developmental costs and the inability 
to make subsequent design modifications associated with ASICs 
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B 2 PPG As Vs Traditional Approaches 


Ti xdition illy, DSP functions have been implemented using a general purpose Digi 
t il bignil PiocGSSor (DSP) oi b} using ASIC technology Whenever the application 
ipquiub timing perfoimances beyond the abilities of current generation Digital Sig 
nil Piocessors oi when the expected volume justify a semi custom based solution 
these functions are implemented using standard cell based design methodology in a 
misk piogiammable gate arrav technologj 

IIowcvoi FPGAs proiide an alternative solution that combines the best 
ol both DSP iiid ASIC technologies without their respective limitations Like a 
gcuci il puiposi DSP, FPGAs die programmable and can be modified in situ after 
piodiu turn FPGAs have a flexible architecture that can be configured for a specific 
DSP luiKtion 

B 3 Using FPGAs 

FPG Vs u( pro wired ciicuits that are progiammed by the users m situ, instead of 
dmmg the fabiication This is achieved by downloading a user defined configuration 
in the fonn of a binary bit stream 

The big difference between a gate array and a FPGA based ASIC imple 
im ntatioii is that the user specific customizing fabrication steps are eliminated m 
I PG V A g ite airay undergoes atleasi two steps in a silicon foundry This necessi 
l.iUb a caiolul vonfication ot implementation design to avoid expensive design and 
foimdiy delations 

For FPGAs, a designer may specify his implementation by a netlist repre 
sonted by a schematic diagram and then rely on an automatic place and route tool 
to map and interconnect components m the netlist implementation into the FPGA 
sites An alternative to using a schematic diagram for the design description is to 
use a Hardware Description Language (HDL) such as Verilog or VHDL Synthesis 
tools arc then used to translate this HDL description into the netlist representation 
using components from the Gate Array s cell library 
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Figure B 1 FPGA design flow 

Fig B i shows a typical FPGA design flow where the designer has specified 
the design with eithci a schematic or HDL description 

B 4 Advantages of FPGAs 

The ulvantagcs of FPGAs are 

• D( sign upgiadation docs not require replacement of the FPGA Reprogramma 
bility IS tilt solution, some of FPGAs are reprogrammable 

• R( programmable FPGAs can be dynamically reconfigured within the system 
Designs which can adapt to the changing conditions can be built with FPGAs 
Reconfigurability is a strong reason in favour of FPGAs 

• In addition FPGAs avoid the need to develop accompanying test programs to 
catch manufacturing defects 
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B 5 Xilmx FPGAs 


\ilm\ rPGA devices possess the following features which enable implementation of 
high ptiloimiiuc design [39] 

• Flexible logic blocks with bit level arithmetic features - allows Distributed 
Aiithmetic implementations of DSP algorithms 

• Distiibuted R\i\I and ROM Increases operand bandwidth 

• V ugistci rich iichitccturc enables a high degree of pipelining leading to 
iiKK iscd pcifoimxnce 

C onhguung i Xilinx PPGA starts with a design, usually a block diagram 
01 1 seheiiulie enteied through a schematic capture tool The schematic is then 
ant 0111 ilK ilh eonverted into the Xilinx Netlist Format (\NF) The \A,CT software 
in i \ilinx deidopmental system, first partitions the design into logic blocks, then 
finds i IK ar optim il placement for each block and finally selects the interconnect 
routing The user has the flexibility to impose specific constraints on the above 

pioccssos of Paititioiiing, Placement and Routing (PPR) 

One c the design is complete it is documented m a Logic Cell Array (LCA ) 
file, from which the PPGA configuring serial bitstream file can be generated Inside 
tin divuc these configuration bits, control or define the combinatorial circuitry, 

flip flops, intcreonneet structure, and the I/O buffers 

Xilmx also piovidcs a module generator called X BLOX for its new gener 
ations of FPGVs (eg XG4000, 4000E etc ,) The module generator can be used 
to impkmcnt data path operators efficiently Besides X BLOX. Xilinx provides a 
library of macros known as Unified Library components to speed up the implemen 

tation process 

Fig B 2 shows Configurable Logic Block (CLB) element of the XCdOOO 
series Each CLB packs a pair of flip flops and two independent 4-mput function 
generatois A powerful and flexible CLB surrounded by a versatile set of routing 
lesources contributes to a high cell usage m a Xilmx FPGA design 
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I i|,ui( B 2 The configuration logic block for the XC4000 


B 6 Pmnauq - A tool for FPGA Design 

Pinnauq is a high level synthesis tool for the Venlog HDL developed by Silicon 
Automation Systems (India) Pvt Ltd Bangalore The tool is meant for synthesising 
designs dt scribe cl in Vc iilog, and which are targetted for FPGAs The tools supports 

1 subsc t ol instriu tions available in Venlog 

The vcisioii of the tool used during this thesis supports transformations 

targeted towiids Xihnx FPGAs This version of Pinnauq supports the XC2000, 
XC3000, XC3100 and XC4000 versions from the Xihnx FPGA family 


B 7 A case study- Viterbi Decoder 

A Viterbi decoder, though not requiring any multiply operation, is however an 
example of a DSP algorithm The real time mathematical processing of the input 
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d \t X faticdin makes it unsuitable for a DSP processor based design The mam limiting 
1 xotoi in the DSP based design is the design and timing of the external SRAM Also, 
e ic h A-dd/Subtiacl xnd Multiplex stage must be performed sequentially in the DSP 



Figure B 3 Viterbi block diagram 

riic algoiithm has been shoxvn graphically in Pig B 3 This algorithm is 
well suited loi implementation based on FPGAs The ability of FPGAs to process 
p 11 xll( 1 d ita paths enables the implementation of the parallel structures for the four 
\DD /SUB blocks m the first stage and the two SUB blocks in the second stage 
ihe two MUX blocks take advantage of the ability to register and to hold data 

until ncedi d, with no additional clock cycles 

IIowc voi, higher constraint lengths call for partitioning of the design before 
mxpping It to (he FPGAs For our design, with constraint length K=9, we had 
to pirtition oui ACS unit, and the memory banks had to be mapped separately 
This can cause some unforseen delays which need to be taken into account while 

simulating the Venlog code 
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